Online Clustering of Linguistic Data

نویسندگان

  • Lev Reyzin
  • Moses Charikar
چکیده

Clustering text data online as it comes in is a difficult problem. It is both hard to capture a meaningful notion of linguistic similarity and to cluster large amounts of data in a single pass. This problem is especially challenging because most known algorithms that ensure tight clusterings are inefficient on large datasets. While significant work has been done on text clustering, it has not been fully explored. In this paper, we discuss previous methods in text clustering and then develop a single-pass text clustering algorithm designed specifically for clustering news stories, (but more widely applicable) and examine its empirical behavior. We then analyze some of its key design features and compare them to possible alternative methods. Finally, we discuss possibilities for further improvement of our algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Linguistic Analysis of the Online Debate on Vaccines and Use of Fora as Information Stations and Confirmation Niche

This study looks at the communication between users concerning health risks, with the aim of exploring their use of fora and assessing whether participants establish a niche with like-minded users during these exchanges. By integrating a corpus linguistic approach with content analysis and multiple studies on computer mediated health discourse, this study analyses the intense attention paid to ...

متن کامل

BotOnus: an online unsupervised method for Botnet detection

Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...

متن کامل

Learning Linguistic Descriptors of User Roles in Online Communities

Understanding the ways in which users interact with different online communities is crucial to social network analysis and community maintenance. We present an unsupervised neural model to learn linguistic descriptors for a user’s behavior over time within an online community. We show that the descriptors learned by our model capture the functional roles that users occupy in communities, in con...

متن کامل

Linguistic variables determination using fuzzy clustering

The fuzzy sets defining the linguistic variable values can be seen as a fuzzy partition of the linguistic variable. The membership functions obtained using fuzzy clustering algorithms are defined with respect to the group prototypes, and they cannot be used to define the linguistic variable values. We introduce several criteria to pass from the clustering membership functions to the linguistic ...

متن کامل

A new method for fuzzification of nested dummy variables by fuzzy clustering membership functions and its application in financial economy

In this study, the aim is to propose a new method for fuzzification of nested dummy variables. The fuzzification idea of dummy variables has been acquired from non-linear part of regime switching models in econometrics. In these models, the concept of transfer functions is like the notion of fuzzy membership functions, but no principle or linguistic sentence have been used for inputs. Consequen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004